DrugDoctor: enhancing drug recommendation in cold-start scenario via visit-level representation learning and training

Abstract Medication recommendation is a crucial application of artificial intelligence in healthcare. Current methodologies mostly depend on patient-level longitudinal representation, which utilizes the entirety of historical electronic health records for making predictions. However, they tend to overlook a few key elements: (1) The need to analyze the impact of past medications on previous conditions. (2) Similarity in patient visits is more common than similarity in the complete medical histories of patients. (3) It is difficult to accurately represent patient-level longitudinal data due to the varying numbers of visits. To our knowledge, current models face difficulties in dealing with initial patient visits (i.e. in cold-start scenarios) which are common in clinical practice. This paper introduces DrugDoctor, an innovative drug recommendation model crafted to emulate the decision-making mechanics of human doctors. Unlike previous methods, DrugDoctor explores the visit-level relationship between prescriptions and diseases while considering the impact of past prescriptions on the patient’s condition to provide more accurate recommendations. We design a plug-and-play block to effectively capture drug substructure-aware disease information and effectiveness-aware medication information, employing cross-attention and multi-head self-attention mechanisms. Furthermore, DrugDoctor adopts a fundamentally new visit-level training strategy, aligning more closely with the practices of doctors. Extensive experiments conducted on the MIMIC-III and MIMIC-IV datasets demonstrate that DrugDoctor outperforms 10 other state-of-the-art methods in terms of Jaccard, F1-score, and PRAUC. Moreover, DrugDoctor exhibits strong robustness in handling patients with varying numbers of visits and effectively tackles “cold-start” issues in medication combination recommendations.


Introduction
The increasing availability of comprehensive health data, such as electronic health records (EHRs), has paved the way for predictive models in clinical decision-making [1][2][3].The recommendation of effective and safe medication combinations plays a crucial role in providing appropriate and effective treatment decisions for patients with multiple diseases.Extensive research efforts have been dedicated to the field of medication recommendation [4][5][6][7][8][9].Existing methods can be divided into three types: rule-based methods, instance-based methods, and longitudinal methods.
Rule-based methods use predefined rules designed by medical experts to guide the recommendation process [10,11].For instance, Chen et al. [11] utilized knowledge patterns to code the clinical guidelines for chronic diseases, which serve as rules for medication recommendation.However, rule-based methods heavily rely on expert knowledge and lack generalization capabilities.Instance-based methods focus primarily on the patient's current health status.LEAP [12] treats the recommendation as a sequential decision-making process.It leverages attention mechanisms to capture dependencies between medication labels in the current visit, and then automatically determines the appropriate medication combinations.Instance-based methods may suffer from lower accuracy due to the insufficient usage of historical visit data.
To overcome this limitation, longitudinal methods incorporate the patient's historical health records to obtain a more comprehensive understanding of patients' conditions.Among them, GAMENet [13], SafeDrug [14], and MoleRec [15] utilize Dual-RNN modules to learn patients' representations from their historical diagnoses and procedure data.However, they all ignore the value of historically prescribed medication information.COGNet [16] develops a novel copy-or-predict mechanism to decide whether to copy a medicine from previous recommendations or to predict a new one.COGNet demonstrates the relevance of historical medications to current prediction.More recently, SHAPE [17] investigates the relationship within the medical events using a compact intra-visit set encoder.It then employs a soft curriculum learning method and makes predictions based on the patient-level representation learned by an inter-visit longitudinal encoder.
Although previous works have shown promising results, they still face two critical limitations: (1) Challenges in the cold-start scenario: These models typically rely on patient-level representation derived from the entire historical diagnosis and procedure information of a specific patient.However, similarities among patients may only be evident in certain individual visits rather than across their complete medical histories.As a result, methods based on patient-level representation may not effectively capture the associations between diseases and drugs at the visit level.It means that they hardly generalize known prescription results at the visit level to similar conditions of other patients.Consequently, their predictive performance and application scenarios are limited, particularly when dealing with new patients.(2) Insufficient utilization of historical prescription information: Existing methods often focus on using historical diagnosis and procedure information for historical representation, overlooking the impact of historical prescriptions on the patient's condition.However, it is necessary to recognize that a patient's current health condition is commonly related to the last treatments, following the administration of previously prescribed medications.
In clinical practice, the prescription process of doctors for patients typically involves the following steps: • Initially, the doctor evaluates the patient's health condition based on the diagnosis and procedure information of the current visit.They also consider similar cases they have encountered before (according to their professional experience) to provide an initial prescription.• Subsequently, the doctor takes into account the patient's previous prescriptions and the effectiveness of medications in improving the patient's condition.By incorporating this information, doctors determine the most suitable prescription or combination of medications for the patient's current visit.
Motivated by this process, we developed a visit-level model named DrugDoctor to address the aforementioned limitations.DrugDoctor focused on extracting correlation information between diseases and drugs prescribed by doctors at the visit level while also investigating the impact of historical medications on the patient's condition.Specifically, transformer-based encoders were introduced to explore the intrinsic relationship within different medical information.A plug-and-play block named CA-MHSA was proposed to extract information with specific awareness, which incorporates cross-attention and multi-head selfattention mechanisms.We utilized the medication substructure information extracted by a graph neural network as a query to the CA-MHSA block to explore the associations between diseases and medications in the current visit.Additionally, we employed a recurrent neural network and another CA-MHSA block to determine the contributions of historical medications to the present prediction.Different from previous works, to better align with the real-world workf low of doctors, we trained the model in a visit-by-visit manner.The experiments conducted on several widely used datasets showed that DrugDoctor outperformed 10 state-of-the-art methods, especially in the challenging cold-start scenario.

Electronic health records
EHRs are digitally stored and managed personal health information, aiming at providing centralized access and improved decision support.Formally, EHR for patient i can be represented Figure 1.The EHR data of a patient consist of a sequence of visits V (0) , V (1) , ..., V (t) ; each visit contains a set of medical codes (i.e. the diagnosis code D (t) , the procedure code P (t) , and the medication code M (t) ) ; the red box and purple box denote the available information of the current visit and the last visit, respectively,; the green box represents the entire historical prescriptions; pre is the predicted medication combinations at current visit.as a time-based sequence V i = [v (1)  i , v (2)  i , . . ., v ], where v (t) i represents the tth visit of patient i and N i is the total number of visits of patient i.We omit the patient index i to simplify the notation if there is no confusion.v (t) consists of D (t) , P (t) , and M (t) , which are, respectively, the set of diagnosis, procedures, and medications that appeared in tth visit.It can be further denoted as a concatenation of multi-hot vectors v (t) = [d (t) , p (t) , m (t) ], where |P| , and m (t) ∈ {0, 1} |M| .D, P, and M refer to the set of all appeared diagnoses, procedures, and medications, respectively.Figure 1 described the visit-level EHR data.

Known DDI relation matrix
In real-world clinical practice, it is common to encounter drugdrug interactions (DDIs).DDIs can yield both positive and negative outcomes.However, medication recommendation systems must prioritize patient safety by actively managing the DDI rate, particularly when the potential effects of such interactions remain uncertain.This precautionary approach ensures that the recommended drug combinations are as safe as possible.To achieve this, we utilize a symmetric matrix D ∈ {0, 1} |M|×|M| as prior knowledge to describe the known DDI relation between pairs of drugs.D ij = 1 indicates the presence of an interaction between drug i and drug j.

Medication combination recommendation problem
For a patient, given his current diagnosis D (t) , procedures P (t) , historical EHR sequence [v (1) , v (2) , . . ., v (t−1) ], and the DDI relation matrix D, medication combination recommendation is to recommend an appropriate combination of medications M(t) for the patient.

Methods
As illustrated in Fig. 2, DrugDoctor comprises of three main components: (1) A visit-level representation module that captures the relationship between drugs' substructures and disease information of the current visit and generates an initial recommendation.
(2) A historical visits learning module provides further prediction results by learning all available historical prescriptions and the nearest health condition of the patient after the last administration.
(3) A recommendation prediction module calculates the final

Input representations
We use three learnable embedding tables, E d ∈ R |D|×dim , E p ∈ R |P|×dim , and E m ∈ R |M|×dim (corresponding to diagnosis, procedure, and medication, respectively), to project corresponding multi-hot vectors into corresponding embedding spaces.Here, dim is the embedding size.Specifically, each row of E d serves as a unique representation vector corresponding to a specific diagnosis code.For the tth visit, when passed through the embedding table |×dim .Similarly, we obtain the representations of procedure set and medication:

Visit-level representation module
The professional knowledge of doctors is fundamental for prescribing medication to patients.Likewise, in the context of a computational model, it is essential for the model to effectively learn the relationship between diseases and medications.Here, we extract the intrinsic representation of disease information by analyzing the diagnosis and procedure information at the visit level.Usually, there exists a significant correlation between the codes within the diagnosis set (as well as the procedure set).
To capture this intrinsic correlation and obtain more accurate representations, a Transformer-based [18] encoder is introduced to further represent the diagnosis information, which can successfully capture the dependencies between elements in the sequence.To be precise, the encoder consists of two major sublayers, Multi-head Self-attention layer MH(•, •, •) and Position-wise Feed-Forward Layer FFN(•).Residual connections are used around each sub-layer, followed by layer normalization.Formally, the Transformer-based encoder can be defined as where X is the input (e.g.D t ).MH(•, •, •) is defined as follows: where where is comprised of two linear layers separated by a ReLU activation. where , and b F 2 ∈ R dim are the trainable parameters.According to [18], the inner-layer has dimensionality s = 2048.
Given the diagnose and procedure representation in tth visit, the outputs of corresponding encoders can be formulated as follows: Additionally, the biochemical activity of medications is typically linked to specific molecular substructures, as reported in [19].Therefore, we believe that analyzing the relationship between the disease and drug substructures can be beneficial for medication recommendation.To obtain drugs' substructures from their molecules, BRICS [20] method is adopted, which is accessible through the RDKit [21] package.
Since graph isomorphism networks (GIN) [22] have been widely used to learn molecule representation, we introduce a threelayer GIN to encode molecule substructures.Given a molecular substructure graph G = {V, E}, where V is a set of atoms (i.e.nodes) and E is a set of chemical bonds (i.e.edges).For atom v, GIN encoder represents it by aggregating the features of its neighbors, the layer-wise propagation rule is described as follows: where MLP(•) is a multilayer perceptron, (k) is a learnable parameter, b (k) v is the representation of atom v at the kth layer, and N (ν) is a set of neighbors of atom v.To obtain a global representation of drug substructure, we adopt global mean-pooling on all-atom representations.Finally, we collect the representation of all drug substructures into a table E drug .
Moreover, to enhance the prediction performance, we developed a novel CA-MHSA block based on cross-attention and multihead self-attention mechanisms to obtain disease information with specific awareness.Precisely, CA-MHSA block is defined as follows: where Q, D, and P are the inputs of CA-MHSA block, and [•] is the concatenate operation.When treating the drug substructures information E drug as a query, we aim at capturing the drug substructure-aware disease information in the current visit.
Then, we concatenate E t ∈ R 2dim , D t ∈ R dim , andP t ∈ R dim into a more compact disease representation in tth visit.Finally, a feedforward neural network FF 1 (•) : R 4dim → R dim is applied to generate a preliminary drug recommendation results, where dim = |M|.

Historical visits learning module
For patients with a history of medical visits, DrugDoctor provides additional drug recommendations based on the following two aspects.Historical prescription information.In clinical practice, there is a strong correlation in the drug recommendations for the same patients.For instance, patients with chronic diseases often continue using the same medications over their lifetime.It is observed that a significant portion of the prescribed medicines in most visits are repetitive recommendations [16].Hence, we use the RNN model to learn historical medication records and provide useful recommendations for current visit.
For a specific patient, m (t)  h is the predicted result based on all historical medication records of this patient.
The effect of medication on condition.In general, a patient's current state of health is often inf luenced by the effects of previously prescribed medications on prior conditions.Accordingly, we argue that examining the inf luence of prior medications on the previous condition is helpful for the current drug predictions.Similarly, we use our proposed CA-MHSA block to fill this task.
To be specific, the prior medication information M t−1 is considered as a query to explore its effect on previous diagnose information D t−1 and procedure information P t−1 , which are encoded by corresponding Transformer-based encoders.And E t−1 represents the effectiveness-aware medications information.Subsequently, we use a FF 2 (•) : R 3dim → R dim to predict the current drug combination based on the previous medications and its effectiveness. where

Recommendation prediction module
After obtaining recommendations from various sources, we naturally merged them together to obtain a more comprehensive prescription recommendation.
where σ (•) is sigmoid function.Every element of ô(t) denotes an appearance probability of the corresponding drug in tth prescription.Then, by setting a threshold value δ, we can obtain the recommended drug combination by selecting the entries with values greater than δ, represented as a multi-hot vector m(t) .

Training and inference
Multiple loss.The recommendation task can be treated as a multi-label classification task.As a result, we select the binary cross-entropy loss as the loss function for the multi-label task.
where subscript represents a drug.Additionally, to control the DDI rate of predicted drug combinations, in line with [ 14], we defined the DDI loss as In the end, the overall objective function is defined as the weighted combination of L bce and L ddi , i.e.
where α is a trade-off parameter to balance the prediction loss and DDI loss.
Training strategy.To our knowledge, the training approaches of previous similar medication recommendation models take the whole information (including the entire medical history) of a individual patient as a training unit.Although these approaches could be effective for individuals with extensive medical histories, their performance may be less robust for new patients with limited medical records, making it challenging to extend known prescription outcomes from one patient to another in the coldstart scenario.
The limited generalization capability can be attributed to the fact that similar medical conditions might only present themselves during specific visits instead of persistently across a patient's overall health history.Each visit serves as a unique snapshot of a patient's health condition and corresponding treatment needs, and the efficiency of prescribed medications often varies across different visits.Therefore, relying solely on the overall health state of a patient may fail to accurately capture the specific requirements for medications during individual visits.
To overcome this challenge, it is crucial to develop models and approaches that can effectively capture specific features of each visit, taking into consideration the specific medical conditions and treatment requirements at a specific visit.In real-life situations, the order of visits of different patients is random and overlapping.We believe that training the model visit-by-visit is more in line with the practices of doctors.
Based on these insights, we have implemented a fundamentally new training strategy for our model.Given a set of patients, we generate a random sequence of all visits of the patients while preserving the chronological order of each patient's visits.In essence, the targeted sequence of visits is globally unordered, but the relative order of visits from an identical patient remains unchanged.
DrugDoctor is trained end-to-end with a manner of visit-byvisit, and all the learnable parameters would be optimized.During the inference phase, the model operates following the same pipeline as training.To obtain the recommended drug combination, the threshold value δ is set to 0.4 on the output drug representation ô(t) in Eqn.(13).By focusing on the visit-level information, it becomes possible to make more accurate and personalized medication recommendations, even for new patients with limited historical data.

Dataset and metrics
To validate the effectiveness of DrugDoctor, we collected 15 032 hospital admissions records of 6350 patients from MIMIC-III [23] and 23 525 hospital admissions records of 9862 patients from MIMIC-IV [24].In line with [14], we processed the patients' EHR data and presented the corresponding statistics in Table 1.Each dataset was typically divided into training, validation, and testing sets using a 4 : 1 : 1 ratio.To achieve visit-level training, each patient set was transformed into targeted visit sequences, which served as the inputs for the model.Four common metrics are employed to evaluate the performance: DDI rate, Jaccard Similarity Score (Jaccard), F1-score, and Precision-Recall Area Under Curve (PRAUC).The DDI rate measures the safety of predicted drug combinations and another three metrics are used for evaluating the recommendation efficacy.The detailed definitions of each metric are presented as follows: where N is the total number of visits of the patient and D is the known DDI relation matrix.M(t) i denotes the ith recommended drug in tth visit.1{•} is 1 when {•} is true, otherwise is 0. The Jaccard for the patient is calculated as follows: where M (t) is the ground-truth medications set in tth visit and M(t) is the predicted result.The F1 of the patient is calculated as follows: The PRAUC can be calculated as where Precision(k) t and Recall(k) t are the precision and the change of recall at cut-off k in ordered retrieval list, respectively.

Experimental settings
We compare the proposed DrugDoctor with the following 10 baseline methods: • LR is a standard logistic regression.
• LEAP [12] is an LSTM-based generation model.It regards recommendations as a sequential decision-making process based on diagnosis information.• DMNC [26] utilizes a memory-augmented network with two controllers and a write-protected mechanism to conduct prediction.• GAMENet [13] integrates the DDI graph through a graphaugmented memory module and utilizes longitudinal patient records for prediction.• MICRON [27] conducts medication prediction by capturing changes in drugs between different visits using a recurrent residual network.• SafeDrug [14] introduces molecule structure information to enhance medication recommendation.• COGNet [16] proposes a copy-or-predict mechanism to generate the medication set.• MoleRec [15] investigates the relationships between the health condition of patients and molecular sub-structures to improve the prediction.• SHAPE [17] is a recent approach that proposes a sample adaptive hierarchical medication prediction network to fill the task.
Our model was implemented using PyTorch 1.9.1 with Python 3.8.18 and trained on an NVIDIA GeForce RTX 4090 GPU.The random seed was set to 1023 for reproducibility.It was trained using the Adam optimizer with a learning rate of 5 × 10 −4 and a batch size of 16.The hyperparameters of the model were selected objectively based on their performance on the validation set.We set the threshold δ = 0.4 and the trade-off parameter α = 0.5.The RNN component was implemented using a Gated Recurrent Unit.The testing process was performed according to the previous work COGNet [16].We randomly sampled 80% of the test data for each evaluation round, repeating this process 10 times.The mean and standard deviation of these 10 rounds were calculated and reported as the final outcome.All the baselines in our study were implemented using the optimized parameters as described in the respective references.

Performance comparison
Table 2 presents a comprehensive summary of the prediction performance of all methods on the widely used MIMIC-III dataset.All in all, our proposed DrugDoctor outperforms all baselines with higher Jaccard, F1, and PRAUC scores while maintaining a relatively lower DDI rate.Among the baselines, LR, ECC, and LEAP exhibit poor prediction performance since they only consider patient information within the current visit.On the other hand, the longitudinal-based methods, which consider the patient's medical history, achieve relatively better performance.MICRON achieves improved performance and a low DDI rate by predicting fewer drugs.SafeDrug and MoleRec both combine molecular representations of drugs and employ specific DDI control strategies to adaptively balance the accuracy and safety of predicted medications.
In particular, COGNet and SHAPE make efforts to learn visitlevel knowledge and achieve good results.However, it is worth noting that the implementations of these two models still essentially learn at the patient level, which means their performance may vary depending on the number of patient visits.To demonstrate the effectiveness of visit-level training, we conducted additional comparative experiments of DrugDoctor with COGNet and SHAPE on the MIMIC-IV dataset.The results presented in Table 3 indicate that DrugDoctor possesses enhanced generalization capability compared with COGNet and SHAPE, both of which are representative methods adopting patient-level training.
Additionally, to further investigate the robustness of COGNet, SHAPE, and DrugDoctor, we conducted experiments on patients with varying numbers of hospital visits.More specifically, we selected patients with specific numbers of hospital visits, and conducted experiments on the selected patients to assess the models' performances under different scenarios.Since the majority of patients in the MIMIC-III dataset have fewer than five visits (see Fig. 3 (A)), we focused on the first five visits of patients in the dataset, and tested the models' performances on the patients with exact n visits for n = 1...5 (Fig. 4).For the the MIMIC-IV dataset, most patients have fewer than seven visits (see Fig. 3 (B)).Therefore we tested the models on the patients in the dataset with exact n visits for n = 1...7, and the experimental results are shown in Fig. 5.Both Fig. 4 and Fig. 5 illustrate that DrugDoctor exhibits remarkable robustness when the number of visits is varying.There are two important observations: firstly, DrugDoctor outperforms COGNet and SHAPE significantly, particularly when dealing with new patients.This indicates that DrugDoctor effectively addresses the cold-start problem.Secondly, all models achieve their highest performance in Fig. 4 when the number of visits is two, which may be attributed to the fact that the proportion of the patients with two visits in the MIMIC-III is much higher than others.
In summary, these findings not only validate the effectiveness of our model, but also highlight its potential in various practical scenarios.

Ablation study
To verify the effectiveness of each component of DrugDoctor, we design the following ablation models: • DrugDoctor w/o Block1: by removing CA-MHSA Block1, the diagnosis information and procedure information are directly combined without the guidance of drug substructure information.• DrugDoctor w/o Block2: by removing CA-MHSA Block2, the medication records from previous visits are used without considering the impact of medications on the patient's condition.• DrugDoctor w/o DDI: by removing L ddi in Eq. 16 from the loss function, the model is trained only by the multi-label classification loss.• DrugDoctor w/o RNN: by removing the RNN module, the model ignores the medication records of earlier visits.
Table 4 shows the performance of the different variants of DrugDoctor.As expected, the results of DrugDoctor w/o Block1 indicate that the relationship between disease and the function of drug substructures is beneficial.And the CA-MHSA Block1 could effectively explore the drug substructure-aware disease information in visit-level representation.Similarly, the results of DrugDoctor w/o Block2 suggest that investigating the inf luence of prior medications on the previous condition is helpful for the current drug prediction.DrugDoctor w/o RNN yields poor results among all ablation models, highlighting the necessity of incorporating historical medication records.Additionally, the results of DrugDoctor w/o DDI demonstrate that the combination loss function could effectively balance the accuracy and safety of predicted medication combinations, which is consistent with the fact that the DDI rate typically increases with increasing accuracy.

Case study
To intuitively illustrate the advantages of DrugDoctor over COGNet and SHAPE, we randomly sampled several example cases and analyze their predicted results.Due to space limitations, diagnostic and procedural data are abbreviated using International Classification of Diseases (ICD) codes, while medications are encoded via the Anatomical Therapeutic Chemical (ATC) classification system.Table 5 is about patient A, who was randomly sampled from the patients with two visits in the MIMIC-III test set.In Table 5, "hit" and "error" refer to numbers of medications that are correctly and wrongly recommended, respectively, and "missed" denotes the number of medications present in the ground truth prescription that have not been recommended.As indicated in Table 5, all three models achieved good performance at the second visit of the patient.However, the recommended results for the initial visit  For patient A, who had only two hospital visits, the recommendation results highlight that DrugDoctor excels in dealing with new patients during their initial visit, successfully generalizing Figure 6.Recommended medications for patient B randomly from the MIMIC-IV dataset, where "hit" and "error" refer to the numbers of medications that are accurately and wrongly recommended, respectively; "missed" denotes the number of medications present in the ground truth prescription that have not been recommended.the learned prescription knowledge to these new cases.In the scenario of patient B, who had four visits from a different dataset, DrugDoctor maintained better or competitive performance compared with the other models.Overall, the results of the case studies on these two patients from different datasets further demonstrate the superiority and generalization capabilities of DrugDoctor compared with the representative benchmark methods.

Discussion and conclusion
Recent advancements in healthcare technologies and the growing availability of patient data have facilitated the development of intelligent medication recommendation systems.This research paper proposes DrugDoctor, a novel drug recommendation model that emulates the decision-making process of human doctors.Unlike previous approaches, we introduce the CA-MHSA block, which explores both drug substructure-aware disease information and effectiveness-aware medication information.Additionally, we achieved fine-grained dataset segmentation, enabling model training at the visit level for the first time.Extensive experiments across various datasets confirmed the superior performance of DrugDoctor compared with other methods and its robustness in handling patients with different numbers of visits.
Importantly, DrugDoctor effectively addresses the prevalent cold-start problem encountered in medication combination recommendations.The effectiveness of DrugDoctor can be attributed to two main reasons.Firstly, both drug substructureaware disease information and effectiveness-aware medication information extracted by CA-MHSA blocks are beneficial for improving prediction.Secondly, the novel visit-level training strategy enhances the data mining performance of EHR data, as evidenced by the comprehensive comparison with COGNet and SHAPE, which are two representative patient-level training methods.
Despite these advancements, implementing intelligent medication recommendation systems still presents several challenges.Ensuring data privacy and security is paramount, given the sensitive nature of patient information.Furthermore, integrating medication recommendation systems with existing EHR-based platforms requires ensuring data quality and consistency, as EHR data from various sources may differ in format and reliability.Advanced computational infrastructure is also necessary to handle and analyze large volumes of heterogeneous data in real-time.Lastly, enhancing the system's interpretability and transparency is crucial for gaining healthcare providers' trust and facilitating effective use in clinical settings.
In conclusion, the development of intelligent medication recommendation systems represents a significant step toward providing personalized and accurate healthcare.However, there are still important areas for further research.Current efforts often focus solely on minimizing DDIs, but it is also essential to recognize that some DDIs can be beneficial, and not all interactions are harmful.Incorporating more comprehensive medication knowledge can help develop rational DDI control strategies that optimize both efficacy and safety.Additionally, recent researches [28,29] suggested that leveraging structural information from heterogeneous healthcare networks could enhance prediction accuracy.Thus, using heterogeneous graph neural networks to process EHR data might improve recommendation outcomes by capturing rich structured information with fewer data consistency constraints.

Key Points
• DrugDoctor is the first drug recommendation model to achieve visit-level representation learning and training, which is more in line with the practices of doctors.• A plug-and-play CA-MHSA block is proposed to capture the drug substructure-aware disease information and the effectiveness-aware medications information, significantly improving the prediction performance.• DrugDoctor exhibits a distinct advantage when dealing with the first-time visits of new patients in the cold-start scenario.• The experimental results on the benchmark dataset validate the effectiveness and robustness of our proposed method, demonstrating the superiority of DrugDoctor over the state-of-the-art baselines.

Figure 2 .
Figure 2.An overview of DrugDoctor; (A) visit-level representation module; it provides the preliminary recommendation m(t) c by exploring the diseasemedication associations in the current visit; (B) historical visits learning module; it takes all historical prescriptions and the nearest historical diagnosis and procedure information as input and generates the recommendation based on historical visits, m (t) h and m (t) p ; in the end, the final prediction results m(t) are obtained; the details of our proposed CA-MHSA block are also presented in the picture.

Figure 3 .
Figure 3. (A) The histogram of hospital visits of a patient in the MIMIC-III dataset; (B) the histogram of hospital visits of a patient in the MIMIC-IV dataset.

Figure 4 .
Figure 4.The performances of three models on the patients with specific visits in the MIMIC-III dataset.

Figure 5 .
Figure 5.The performance of three models on the patients with specific visits in the MIMIC-IV dataset.

Table 2 .
Performance comparison on the MIMIC-III dataset; the best results are highlighted in bold

Table 3 .
Performance comparison on the MIMIC-IV dataset; the best results are highlighted in bold

Table 4 .
Ablation study of DrugDoctor on the MIMIC-III dataset

Table 5 .
Recommended medications for patient A with two visits clearly demonstrate the distinct advantage of DrugDoctor.Our model obtained more accurate recommendation compared with the other models, with 23 "hit"s and the lowest number of "error"s and "missed"s.To validate the capabilities of DrugDoctor in more complex clinical scenarios, we randomly sampled a patient (called patient B) from the MIMIC-IV test set with four hospital visits.The recommended results for patient B by the models are presented in Fig.6.DrugDoctor achieved the best performance in the first, second, and fourth visits.